Skip to content

fix(server): different graph name share the same backend#3027

Merged
imbajin merged 3 commits into
apache:masterfrom
JisoLya:fix-parse-err
May 17, 2026
Merged

fix(server): different graph name share the same backend#3027
imbajin merged 3 commits into
apache:masterfrom
JisoLya:fix-parse-err

Conversation

@JisoLya
Copy link
Copy Markdown
Contributor

@JisoLya JisoLya commented May 15, 2026

Purpose of the PR

During backend initialization, the server creates two graphs — ~sys_graph and hugegraph (default) — that share the same store backend. This leads to a key collision:

  • ~sys_graph writes server node info with id server-1
  • ~sys_graph reads back server-1 successfully
  • hugegraph writes server node info with the same id server-1
  • hugegraph reads from the shared table — retrieves ~sys_graph's record instead → type mismatch → parse error

Main Changes

Change ServerInfoManager.selfNodeId() which returns "server-1" previously to "{graphname}/server-1"

Upgrade impact

This change namespaces the server id by graph name, so old unfinished tasks in the non-PD local scheduler may still reference the previous bare server id, for example server-1.

Those historical tasks can remain visible, but they may not be restored or cancelled by the new namespaced server id after upgrade. The impact is limited to unfinished local-scheduler tasks that already existed before upgrading; newly scheduled tasks use the new namespaced id. To avoid this compatibility edge case, finish or cancel pending local tasks before upgrading.

Verifying these changes

  • Trivial rework / code cleanup without any test coverage. (No Need)
  • Already covered by existing tests, such as (please modify tests here).
  • Need tests and can be verified as follows:
    • xxx

Does this PR potentially affect the following parts?

Documentation Status

  • Doc - TODO
  • Doc - Done
  • Doc - No Need

Copy link
Copy Markdown
Member

@imbajin imbajin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

修复方向正确,用图命名空间前缀隔离 server ID 能解决跨图 key 碰撞问题。但存在一个阻塞性逻辑缺陷 + 测试缺失,建议修复后再合并。


‼️ 测试覆盖不足(阻塞)

PR 勾选了"已有测试覆盖",但全量搜索 hugegraph-server/hugegraph-test/ 后,没有任何测试覆盖"两图共享同一 backend + 相同 nodeId"的场景。

建议补充回归测试,至少验证:

  1. ~sys_graphhugegraph 使用相同 nodeId(如 "server-1")初始化时,selfNodeId() 返回值互不相同
  2. 两个图能并发启动,不抛出 parse error 或 "already in cluster" 异常

⚠️ 升级兼容性说明缺失

存量部署 backend 中的 ServerInfo vertex key 格式为 server-1;升级后代码改为以 DEFAULT-hugegraph/server-1 查询,旧记录查不到便重新写入新记录,旧的 server-1 记录以孤儿形式留在 backend 中不被清理。对功能影响有限,但建议在 PR 描述或 Release Notes 中注明,避免运维困惑。


🧹 小建议:spaceGraphName 字符集

spaceGraphName() 的值由用户配置拼接而来,没有对 / 字符的校验。若用户将 graphSpace 配置为含 / 的字符串,生成的 ID 会出现多个 /,可能引发后续解析歧义。建议加一行断言或在文档中说明 graphSpace 不允许包含 /

// Scope server id to graph to avoid cross-graph collision
return IdGenerator.of(this.graph.spaceGraphName() + "/" +
this.globalNodeInfo.nodeId().asString());
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

‼️ Critical:initServerInfo() 中读写路径 ID 不一致(阻塞性问题)

此处修改后 selfNodeId() 返回命名空间化的 ID(如 DEFAULT-hugegraph/server-1),但调用方 initServerInfo() 的存在性检查仍使用原始裸 ID,两条路径的 key 不同:

// initServerInfo()(本 PR 未修改这部分)
Id serverId = nodeInfo.nodeId();                        // ❌ 裸 ID:"server-1"
HugeServerInfo existed = this.serverInfo(serverId);    // ❌ 用 "server-1" 查询
// ...
this.globalNodeInfo = nodeInfo;                        // line 140,赋值后 selfNodeId() 才变
this.saveServerInfo(this.selfNodeId(), ...);            // ✅ 保存 "DEFAULT-hugegraph/server-1"

后果:

  • lines 109–122 的 alive 检查(防止同名节点重复启动的保护)永远查不到已命名空间化的旧记录,保护逻辑被静默跳过
  • 重启时旧的命名空间记录不会被正确识别,会直接写入新记录,旧记录成为孤儿数据积累在 backend 中

建议修法:initServerInfo() 入口先设置 globalNodeInfo,再统一用 selfNodeId() 做后续查询,保持读写 key 一致:

public synchronized void initServerInfo(GlobalMasterInfo nodeInfo) {
    E.checkArgument(nodeInfo != null, "The global node info can't be null");
    this.globalNodeInfo = nodeInfo;   // 先赋值,selfNodeId() 立即返回命名空间 ID
    Id serverId = this.selfNodeId();  // ✅ 与 save / remove 路径使用同一个 key
    HugeServerInfo existed = this.serverInfo(serverId);
    if (existed != null && existed.alive()) {
        // ... 存活检查 & 等待逻辑不变 ...
    }
    E.checkArgument(existed == null || !existed.alive(),
                    "The server with name '%s' already in cluster", serverId);
    // master 唯一性检查不变 ...
    this.saveServerInfo(serverId, this.selfNodeRole());
}

这样读取和写入始终使用同一个 ID,alive 检查和 master 唯一性保护才能真正生效。

@dosubot dosubot Bot added the size:S This PR changes 10-29 lines, ignoring generated files. label May 15, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.25%. Comparing base (e108076) to head (b4ede79).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@              Coverage Diff              @@
##             master    #3027       +/-   ##
=============================================
+ Coverage     35.85%   93.25%   +57.40%     
+ Complexity      338       65      -273     
=============================================
  Files           802        9      -793     
  Lines         67995      267    -67728     
  Branches       8902       22     -8880     
=============================================
- Hits          24381      249    -24132     
+ Misses        41008        8    -41000     
+ Partials       2606       10     -2596     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dosubot dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels May 16, 2026
Copy link
Copy Markdown
Member

@imbajin imbajin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, the key collision fix now scopes the server id by graph name and the regression test covers the core case. I added the upgrade-impact note for old local-scheduler tasks; the remaining compatibility edge is limited to unfinished non-PD/local tasks that existed before upgrade, so this looks acceptable to merge from my side.

@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label May 17, 2026
@imbajin imbajin merged commit 66e5339 into apache:master May 17, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm This PR has been approved by a maintainer size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

hugegraph-server 启动不正常,查看日志有报错

2 participants